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Abstract 


Taking into account high-order interactions among covariates is valuable in many practical regression 
problems. This is, however, computationally challenging task because the number of high-order inter¬ 
action features to be considered would be extremely large unless the number of covariates is sufficiently 
small. In this paper, we propose a novel efficient algorithm for LASSO-based sparse learning of such 
high-order interaction models. Our basic strategy for reducing the number of features is to employ the 
idea of recently proposed safe feature screening (SFS) rule. An SFS rule has a property that, if a feature 
satisfies the rule, then the feature is guaranteed to be non-active in the LASSO solution, meaning that 
it can be safely screened-out prior to the LASSO training process. If a large number of features can 
be screened-out before training the LASSO, the computational cost and the memory requirment can 
be dramatically reduced. However, applying such an SFS rule to each of the extremely large number 
of high-order interaction features would be computationally infeasible. Our key idea for solving this 
computational issue is to exploit the underlying tree structure among high-order interaction features. 
Specifically, we introduce a pruning condition called safe feature pruning (SFP) rule which has a prop¬ 
erty that, if the rule is satisfied in a certain node of the tree, then all the high-order interaction features 
corresponding to its descendant nodes can be guaranteed to be non-active at the optimal solution. Our 
algorithm is extremely efficient, making it possible to work, e.g., with 3"^'* order interactions of 10,000 
original covariates, where the number of possible high-order interaction features is greater than 10^^. 

Keywords: Machine Learning, Sparse Modeling, Safe Screening, High-Order Interaction Model 
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1 Introduction 


Sparse learning of high-dimensional models has been actively studied in the past decades [I] . Among many 
approaches, LASSO [2] is one of the most widely used methods, and its statistical and computational 
properties have been intensively investigated. The main task in LASSO training is to identify the set of 
active features whose coefficients turn out to be nonzero at the optimal solution. In case we know which 
features would be active, the solution trained only with those active features is guaranteed to be optimal. 
This observation suggests so-called feature screening approaches, where we first screen-out a subset of features 
which would be non-active at the optimal solution, and then train a LASSO only with the remaining features. 
The LASSO training can be highly efficient if a majority of non-active features could be screened out a priori. 

Existing screening approaches are categorized into two types. In the first type of approaches called non¬ 
safe feature screening^ a subset of features which are predicted to be non-active at the optimal solution are 
first identified, and then a LASSO is trained after screening out those features. Since the predictions might 
be incorrect, the obtained LASSO solution is used for checking if all the screened-out features are really 
non-active. Unless all of them are confirmed to be truly non-active, some of those features must be brought 
back into the working feature set, and a LASSO is trained again with the updated working feature set. 
In non-safe screening approaches, such a trial-and-error process must be repeated until all the optimality 
conditions are satisfied. 

Another type of approaches called safe feature screening (SFS) was recently introduced by El Ghaoui 
et al. [3]. The advantage of SFS is that the screened-out features are guaranteed to be non-active at 
the optimal solution, meaning that the iterative trial-and-error process is not necessary. The safe feature 
screening approach has been receiving an increasing attention in the literature, and several extensions have 
been recently studied miisiis]. SFS is especially useful when the number of features is extremely large 
and the entire data set cannot be stored in the memory. Once a subset of features are screened-out by SFS, 
those features can be completely removed from the memory because they would never be accessed during 
the following LASSO training process. 

In this paper, we study sparse learning problems for high-order interaction models. Let us denote the 
original training set by {( 2 :*, , where n is the number of training instances, := [zj.. is the 

d-dimensional original covariates, and y* is the scalar response. In high-order interaction models up to order 
r, we have D = («) features. Thus, the expanded n x D design matrix X has the form: 
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Figure 1: The underlying tree structure among high-order interaction features (d = 4 and r = 3). 


Then, we consider LASSO problem 

/3*:=argmin i||y - A||/3||i, (2) 

/3eR° 2 

where y := [j/^... S K" and A > 0 is the regularization parameter which makes a balance between the 
first loss term and the second regularization term. Unless the original input dimension d is fairly small, the 
number of features D would be extremely large. For example, when d = 10,000 and r = 3, we have D > 10^2 
features. Although a variety of LASSO training algorithms have been proposed, it is still computationally 
infeasible to solve such a high-dimensional LASSO problem. Furthermore, it would also be difficult to load 
the entire (expanded) data set in the memory, making it hard to use existing LASSO solvers. 

Our basic strategy is to develop an SFS method particularly suited for high-order interaction models, and 
screen-out majority of non-active features before actually training the model by the LASSO. Unfortunately, 
any existing SFS methods cannot be used for our problem because it is impossible to evaluate each SFS 
rule for each of the extremely large D features. Our key idea for overcoming this computational difficulty 
is to exploit the underlying tree structure among high-order interaction features as depicted in Figure [T] 
We propose, what we call, safe feature pruning (SFP), a novel safe feature screening method for high-order 
interaction models. Our SFP rule is a condition defined in each node of the tree such that, if the condition is 
satisfied in a certain node, then all the high-order interaction features corresponding to its descendant nodes 
are guaranteed to be non-active at the optimal solution. It allows us to safely screen-out a large number of 
high-order interaction features. 

2 LASSO for high-order interaction models 

2.1 Problem setup 

In this paper we study sparse learning of high-order interaction models. Throughout the paper, we assume 
that G [n] x [d] is standardized to [0,1]. As mentioned in the previous section, the training set 


4 


in the original covariate domain is denoted as {(-z*, 2/*)}iG[n] where jz® S [0,1]'^ and j/® G R. In addition, 
we also denote the training set in the expanded feature domain as {(a;®, where cc® G [0,1]^ is the 

expanded feature vector of the training instance, i.e., the row of the design matrix X in ©■ Note 
that high-order interaction features cc®, (z, j) G [n] x [D] is also defined in [0,1]. Furthermore, we denote the 
column of the wide design matrix X as Xj,j G [D]. The problem we consider here is to find a sparse 
solution /3* = [/3*,..., by solving the LASSO problem in ([2|). 


2.2 Preliminaries and basic idea 

When the penalty parameter A is sufficiently large, only a small portion of the D coefficients {/3*}jg_D would 
be non-zero. We denote the index set of the active features as 

A* := {j G [D] : fi* ^ 0}. 

In convex optimization literature, it is well-known that the optimal solution does not depend on non-active 
variables, which is formally stated as follows. 

Lemma 1. Let A he an index set such that A* C A C [D], Then, the solution of the LASSO problem ^ 
is given as 

f3*A = a.ig - XaI3\\1 + M\l3\\i, f^*A = 0, (3) 

where and /3^ are the subvectors of (3* with the components in A and A := [D] \ A, respectively, and 
G [0, is a submatrix of X which only has columns indexed by A. 

Lemma [1] indicates that, if we have an index set A A A* , the optimal solution (3* in ([2]) could be efficiently 
obtained by solving a smaller optimization problem that does not depend on all the D features but only on 
a subset of features in |Al|. 


Safe screening In order to find an index set A A A*, we employ recently introduced technique called safe 
feature screening (SFS) [3]. SFS enables us to find a subset of non-active features without actually solving 
the optimization problem. Roughly speaking, SFS algorithm for LASSO problem is based on its primal-dual 
relationship. The dual problem of the LASSO problem in ([2]) is written as 


{Si} 


mm ^ V' (6'j - \y'' f subject to 

* A 


i^\n\ 




<1 VjG[i?], 


(4) 


where the dual variables. Then, using the standard convex optimization theory (e.g., see [7]), 

we have the following lemma (see [3] for the proof). 


Lemma 2. Let {0*}ie[ra] be the optimal dual solutions of the LASSO dual problem in m- Then, 


i^[n] 


^3 'b 


<I^/3*=0, j€[D]. 


(5) 
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The key idea of SFS is to efficiently compute an upper bound Uj for each j G [D] without actually solving 
the dual problem such that 


Lemma [5] indicates 


Uj> 


ie[n] 


( 6 ) 


Uj<l^ 13* = 0, 

meaning that a non-active coefficient P* might be identified before solving the optimization problem. After 
the seminal work of El Ghaoui et al [5] , several approaches for efficiently computing such an upper bound 
have been proposed Enman]. In this paper, we particularly use an idea of using variational inequality for 
computing upper bounds, which was recently proposed by Liu et al.[5]. 

Safe screening has been used when a sequence of LASSO solutions for various values of A are computed 
(c.f., regularization path [a])- When we compute a sequence of solutions, we start from Amax where a LASSO 
solution for any A > Amax is f3* = 0 . Then, we compute a sequence of LASSO solutions for a decreasing 
sequence of A by using the previous solution as the initial warm-start solution. Upper bounds {Uj}jg[£)] in 
© can be constructed by using the optimal solution of the LASSO for another regularization parameter 
A > A, which should have been already obtained in the above context. 


Tree structure and pruning rule Unfortunately, SFS alone is not sufficient for our problem because it 
is intractable to compute an upper bound Uj for each of the exponentially large number of features. For 
handling such extremely large number of features in the expanded feature domain, we exploit the underlying 
tree structure. We consider a simple tree structure as depicted in Figure [TJ We denote each node of the 
tree by an index j G [D]. For any node j in the tree, let De{j) be a set of its descendant nodes. Our 
main contribution in this paper is to develop a novel SFS method particularly designed for fitting sparse 
high-order interaction models. Specifically, in each node of the tree, we derive a condition called safe feature 
pruning (SFP) rule. Our SFP rule has the following nice property: 


SFP rule for a node j is satisfied ^ 


E 


x)-0* 


< 1 for all j' G De{j). 


(7) 


This property indicates that, if the SFP rule in a certain node j of the tree is satisfied, then we can guarantee 
that all the high-order interaction terms corresponding to its descendant nodes can be safely screened-out. 


2.3 Related works 

Before presenting our main contribution, let us briefly review related works in the literature. Fitting high- 
order interaction models has long been desired in many regression problems. In biomedical studies, for 
example, many complex diseases such as cancer are known to be the consequences of high-order interaction 
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effects of multiple genetic factors [n El min]. In the past decade, several authors proposed extensions of 
the LASSO for incorporating interaction effects, and studied their statistical properties [Hiiaiis]. However, 
none of these works have sufficient computational mechanisms for handling exponentially large number 
of high-order interaction features. Most of these works thus focus only on 2"'^ order interaction features 
for moderate number of original covariates d. One commonly used heuristic for reducing the number of 
interaction features is to introduce so-called strong heredity assumption [miiniiTS], where, e.g., an interaction 
term ziZ 2 would be selected only when both of Zi and Z 2 are selected. However, such a heuristic assumption 
eliminates a chance to find out novel high-order interaction features when there are no strong associations in 
their marginal main effects. To the best of our knowledge, an only exceptional approach that can be applied 
to high-order interaction modeling for sufficiently large data sets is itemset boosting (IB) algorithm presented 
in m- IB algorithm is a boosting-type algorithm, where a single feature is added and the model is updated 
in each step. In a nutshell, IB algorithm manages its working feature set by exploiting the underlying tree 
structure as we do. However, since the working feature set in IB algorithm is non-safe (no guarantee to be 
active or non-active), aforementioned trial-and-error process is necessary. We compare our approach with IB 
algorithm in 21 and demonstrate that the former is computationally more efficient than the latter. Another 
line of related studies is about the statistical issue such as feature selection consistency [181 HI] for high- 
order interaction models. We have been studying how to apply post-selection inference framework recently 
introduced in [5D] to statistical inferences on high-order interaction models in m- 

3 Safe feature pruning for high-order interaction models 

In this section we present our main result. The proposed safe feature pruning (SFP) is used when we 
compute a sequence of LASSO solutions for a decreasing sequence of the regularization parameter A. Let 
Amax > Ai > ... > At be a sequence of regularization parameter values at each of which we want to compute 
the LASSO solution, where Amax := maXjgjT)] 1 0. Furthermore, we denote the optimal LASSO solutions 
for those sequence of regularization parameters as /3*(At) := [/3*(At),... ,/3'f,{Xt)]^ G for t G [T]. 

The outline of the algorithm for computing the sequence of solutions is summarized in Algorithm [T] In 
line 1, we initialize Aq and /3*(Ao) by Amax and its corresponding solution /3*(Amax) = 0- Line 3 is the core 
of our algorithm, where we find a superset Al(At) of A*(At) := {j G [D] : ^ 0} by the proposed SFP 

method (see Theorem[3]). In line 4, the LASSO problem is solved for obtaining by using any LASSO 

solver only with a set of features in Al(At). 

The following theorem is used in line 3 of the Algorithm |T] Given the optimal LASSO solution 
for a regularization parameter At_i, for any A* G (0, At_i), we can develop a SFP rule such that, if the 

^ The largest regularization parameter Amax can be efficiently computed again by exploiting the tree structure among 
features. Specifically, for any node j and its arbitrary descendant node by noting that 0 < x\, < ic*- for any i G [n], we have 

\xj,y\ < max{| I T.ieln]:yi<o^)'yi\} < J2ieln]:yi>o^)yi\’ I T,ieln]:yi<o^]yi\}- Using this relationship, 

we can efficiently find the maximum \xjy\ by searching over the tree with pruning. 
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Algorithm 1 An outline for computing a sequence of LASSO solutions with safe feature pruning 
Input: {( 2 :%J/*)}ig[„], r, {XtjtelT] 

1 ; Ao ^ max^g[£,] \xjy\ and /3*(Ao) ^ 0 
2: for t = 1,... ,T do 

3; Find Al(At) 3 A*{Xt) by safe feature pruning (SFP) method with {Xt-i, /3*{Xt-i)) 

4: Compute the LASSO solution (3*{Xt) by using only a subset of features in Al(At) 

5: end for 

Output: {/3*(At)}te[T] 


rule is satisfied, then all the features corresponding to in its descendant nodes De{j) are guaranteed to be 
non-active at the optimal LASSO solution for the regularization parameter A*. 


Theorem 3 (safe feature pruning with (At_i,/3*(At_i))). Suppose that the optimal LASSO solution 
(3*{Xt-i) is available for a regularization parameter Xt-i- In addition, for any At G (0, At-i), define 

Pi ^ fl|a;jll2||b||2 + c,x]\ , P2 := ^ (\\xj\\2\\b - + Y , 


Ml := - lla: 


“ ) > ^ 2 -.= ^ (\\xj\\ 2 \\b-^i^a\\ 2 - Y 


i:ci <0 






1 1 
At At_i 


y+ a, 


/ 1 1 
VAt ^ At-i 


Then, for any node j in the tree such that Xj 7 ^ 0, 


d := c— - — 777^a. 


max{17+, Uj }< I 


j5*,{Xt) = 0 for all j' G De{j), 


where, when At-i = A^ax = max^gfij] \x'Jy\, 


while, when At-i < Amax; 


C/+:=Pi, U-:=Mi, 


C/+:= max{Fi,P 2 }, U, := max{Mi,M 2 }. 


Furthermore, when the original covariates 2 ! G {0,1} for i G [n] and j G [d], then 


uY.= 


Ui--= 


-^2 if X]ie[n]:ai<0 

max{Pi,P 2 } otherwise, 


^2 tf E^ 6 [n]:at> 0 «*a;;■ < ^ 

max{Mi,M 2 } otherwise. 





The proof of Theorem[3]is presented in supplementary appendix El In the depth-first search in the tree, if 
we encounter a node with Xj = 0, then we can, of-course, guarantee that its descendant nodes would be non¬ 
active. Note that the SFP condition in ([5|) can be computed by using the sparse previous solution /3*(At-i) 
and a set of information available in the node j. It means that, if the SFP rule is satisfied at a certain node 
j in the tree, we can stop searching over the tree and all its descendant nodes can be screened-out. 

Our pruning approach relies on the fact that original covariates are defined in [0,1]. For example, it is 
easy to note that 

Xji < V (/, j) S [D]'^ such that j' G De{j). 

Such diminishing monotonicity properties on the high-order interaction features indicate that higher-order 
interaction features which correspond to deep nodes in the tree are more likely to be non-active than those 
corresponding to shallow nodes. For example, if the original covariates are defined in binary domain, i.e., 
a;* G {0,1}, then the features would be more sparse as we consider higher-order interactions. As we see in 
the following experiment section, when the original covariates are sparse, our pruning approach works quite 
well. 

For covariates defined in binary domain, where values 1 and 0 respectively indicate the existence and the 
non-existence of a certain property, it is easy to interpret interaction effects because they simply indicate 
co-existence of multiple properties. On the other hand, for continuous covariates, the interpretation of an 
interaction effect would depend on its coding. If each covariate is defined in [0,1] domain, and the value 
represents the “degree” of an existence of a certain property, then an interaction effect can be similarly 
understood as the “degree” of co-existence of multiple properties. 

4 Experiments 

In this section, we demonstrate the effectiveness of the proposed safe feature pruning (SFP) approach through 
numerical experiments. 

4.1 Experimental setup 

In the experiments, we computed a sequence of LASSO solutions at a decreasing sequence of regularization 
parameters A. Specifically, we started from Aq = Amax := max^gj^j] |a:Jy|, and considered a sequence 
At = (1 — 0.1/-\/t)At_i for t = 1,2,... until At/Amax < 0.01. We considered interaction model up to r = 3*''^ 
order. As the LASSO solver, we used shooting algorithm. All the codes were implemented by ourselves in 
C-I--I-, and all the experiments were conducted by HP workstation Z800 (Xeon(R) CPU X5675 (3.07GHz), 
48GB MEM). 
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Table 1: Computation time in seconds for computing a sequence of LASSO solutions. 


sparsity(%) 

without SFP 

with SFP 

pruning rate 

95 

12811.66 

170.93 

99.63 

90 

28816.94 

417.07 

99.42 

85 

52343.61 

1248.41 

98.47 

80 

80224.89 

4460.59 

95.63 


4.2 Synthetic data experiments 

First, we investigated the advantage of the proposed pruning scheme by comparing the computational costs 
with and without the safe feature pruning (SFP) on small synthetic data sets. The synthetic data were 
generated from 

y = X/3 + £,£^lV(0,a2/), 

where y G K" is the response vector, X G is the input design matrix, (3 G is the coefficient 

vector, and £ G R" is the Gaussian noise vector. Here, we did not actually compute the extremely wide 
design matrix X because it has exponentially large number of columns. Instead, we generated a random 
binary matrix Z G {0,and each expanded high-order interaction feature Xj was generated from the 
row of Z only when it was needed. For simplicity and computational efficiency, we assumed that the 
covariates (hence interaction features as well) are binary, and the sparsity y G [0,1] (the fraction of Os in 
the entries of Z) was changed among 95%, 90%, 85%, 80% to see how sparsity can be exploited for efficient 
computation. We set /3 = 0, cr = 0.1, n = 1,000 and d = 1,000. The total computation time in seconds 
and the average pruning rates are shown in Table [I] The results indicate that SFP is fairly effective for 
computational efficiency. Furthermore, as the data set gets sparse, the advantage of SFP increases. 

4.3 Benchmark data experiments 

Next, we compared the proposed SFP approach with itemset boosting (IB) algorithm presented in [T7]. IB 
algorithm is a variant of working set method, where a set of working features and a LASSO solution trained 
only with the working feature set are maintained in each step. IB algorithm updates the working set by 
adding a feature that most violates the current optimality condition. The core of IB algorithm is that it can 
efficiently find the most violating feature by exploiting the underlying tree structure and anti-monotonicity 
property. Since there is no guarantee that all the features not in the working set are truly non-active in IB 
algorithm, one must repeat trial-and-error process until all the optimality conditions are satisfied. 

We used seven benchmark datasets in libsvm dataset repository [35] as listed in Table [3] In each dataset, 
we restrict the maximum numbers of instances n and covariates d to be 10,000. Although some of these 
datasets are for binary classifications, we regarded the response {2/i}iG[n] be real variables, and standardized 
them so that they have the mean zero and the variance one. As we discussed in ^ we only considered 
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Table 2: Total computation time for computing a sequence of LASSO solutions in high order interaction 
models (seconds) for various benchmark datasets. 


Data 

n 

d 

<5= 

1.5 

5= 

2.0 

IB 

SFP 

IB 

SFP 

usps 

7,291 

256 

I05I4.8I 

4506.82 

1373.21 

1907.21 

madelon 

2,000 

500 

6685.99 

1865.42 

982.69 

358.08 

protein 

10,000 

357 

33320.01 

1975.00 

20593.55 

1297.30 

mnist 

10,000 

780 

47985.89 

10047.49 

7418.59 

4009.85 

rev I .binary 

10,000 

10,000 

1707.81 

734.60 

2893.67 

826.56 

real-sim 

10,000 

10,000 

III22.02 

3583.31 

I5I0I.93 

3473.46 

news 20 

10,000 

10,000 

> iooa: 

34988.51 

> iooa: 

28568.94 


binary original covariates. For a continuous covariate in the original data set, we first standardized it to 
have the mean zero and the variance one, and then represented the covariate by two binary variables, each 
of which indicates whether the value is greater than S or the value is smaller than —S, where 6 G {1.5, 2.0}. 
The computation time in seconds are shown in Table [H We see in the table that the proposed SFP is 
faster than IB algorithm in almost all cases. Figure [5] shows the results on usps, protein and mnist data 
sets (same figures for other datasets are presented in supplementary Appendix 1]) For each data set, (a) 
computation time in seconds, (b) the number of traverse nodes, and (c) the number of active features are 
plotted for each regularization parameter values. We first see in (a) and (b) that the computation time is 
roughly proportional to the number of traverse nodes. Comparing SFP and IB algorithm in (a) and (b), 
the former is faster than the latter especially when A is small. Furthermore, the computational costs of 
IB algorithm in (a) and (b) seems to be positively correlated with the number of active features in (c). A 
possible explanation of these observations is that, when A is small, IB algorithm must repeatedly search over 
the tree for finding out which feature would be coming into the working set since there are many active 
features that newly enters to the working set. On the other hand, the computational cost of the proposed 
SFP did not increase as IB algorithm because SFP approach can screen-out large number of features at 
the same time. The plots in (c) suggests that, when A is small, many high-order interaction features (2nd 
and 3rd order interaction features are shown in green and blue, respectively) become active, indicating the 
potential advantage of considering high-order interaction features. 

5 Conclusion 

We proposed a safe feature screening rules called safe feature pruning (SFP) for high-order interaction 
models. A key advantage of SFP is that, by exploiting the underlying tree structure among high-order 
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main effect features (red), 2nd order interaction effect (green), and 3rd order interaction effect (blue) features 
Figure 2: Results on usps (left), protein (center), mnist (right) data sets. 


interaction features and its anti-monotonicity property, a large number of high-order interaction features 
can be simultaneously screened out by evaluating a few simple conditions on the tree. As long as the 
original covariates are sufficiently sparse, our algorithm can be used for high-order interaction models with 
the number of features D > 10^^. Our next future work would be to extend this idea to classification setups. 

A Proof of Theorem [3] 

In this section we prove Theorem [3] To prove the theorem, we use a recent results on safe feature screening 
developed in [5] . Our technical contribution in the following proof is in bounding the screening condition of 
a feature in a node based only on the information available in its ancestor node, which is the crucial property 
of the proposed safe feature pruning. 

Although the notations and formulations are different, the following proposition is essentially identical 
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with the recent result in [^. 


Proposition 4 (Liu et al.[5]). Consider a pair of regularization parameters Ai and X 2 for the LASSO 
problem in m such that maxjgp] \xjy\ > Ai > A 2 > 0, and denote the optimal dual variable vectors for 
these two problems as 0* and respectively. In addition, define 


a ^ - ei b ^ - ei c -.= ^ + 0 *. 

Ai A 2 A 2 


Furthermore, for each j G [D], define 


UxJc+\\Xj\\ 2 \\b\\ 2 ) 



■f b a ^ _ 

y llblh - I 


r, — — _ — _ i _i oth ervii He 

2 l|o|i 2 Halls 1 omerwise. 


b'a II I a' b 

+ Halls Halls' 




otherwise. 


Then, for j G [D], 


ma,x{uj,u, } < 1 


\x]e*^\<i, 


(9) 

( 10 ) 


i.e., the j**' feature is non-active in the optimal solution of LASSO with the regularization parameter A 2 . 


For the proof of this proposition, see [5]. 

Proof of Theorem\^ Remember that, for a feature indexed by j, a feature indexed by j' represent a feature 
corresponding to one of its descendant nodes, i.e., j' G De{j). To prove the theorem, using the result of 
Proposition m it is suffice to show that 


<1 => u'^, < I and Uj < 1 => Uy < 1 . 


( 11 ) 


First, we prove the case with At_i = Amax = max^gi^i] \xjy\. In this case, from the primal-dual relation¬ 
ship of the optimal LASSO solution, the optimal dual solution at At_i is written as 0J‘_]^ = Since it 

means a — 0, from (0 and (fTUl) . 


4 = a(xJ'C + 




Using the fact that 


and 


xJ’C= Y. + Y ^ Y. < 

i:ci>0 i:ci<.0 i:ci>0 


Y 

2 :Ci >0 


( 12 ) 


||a;j'||2 < \\Xj\\2, 
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we have 


= \i^]'^+\\^o'h\\bh) Cii:} + ||a;j|| 2 || 6 || 2 ) =-Pi, (13) 

2 :ci >0 

V = \i-^}^+\\^3'h\\H2) < ^(- Cix] + llajjibllbib) = Ml, (14) 

2 ;ci <0 

which proves the theorem when At_i = Amax- 

Next, we consider At_i < Amax- In this case, we cannot decide which of the two cases in ([5]) and (ITUl) are 
applied. In the first case of m and (nni) are applied, we can show that 


Uj, < Pi and Uj, < Mi 


in the same way as 0 and m- On the other hand, when the second case of m is applied, we can show 
that 


+ 1/ T , „ b “ „ a^bxj,a^ 

V = oi^fC+\\Xj> - 77-TT2“I|2|I^- 7T-TT2“I|2- TT-TrTTTTT' 

■> 2 a i \\aU a 2 a h 


1 , T J II II IIL O- II N 

= -(»,.rf+K,-j^a|b|| 6 -^a|b) 

- Ai 3ix] + \\x^\\2\\b - ^^a\\2} = P2, 


2 ^ ^ ^ 
i:di>0 


a 


where we used 


J'd — "Y + Y^ - X] — Yx 

i:di>0 i:di<.0 i:di>0 i‘.di>0 


^3 


and 


x^,a 


a|h = \ \\Xj'M - 


{xj,ay^ 


JT- < ll^^l'lh < \\Xj\\2 


When the second case of m is applied, we can show the following in the same way as above: 

1 T xj,a foT xj,a 

u- = -{-xJc+\\x,-^^aUb-j^^ah' ^ ^ 


-3' 2 ' ^ 

2 (“S 
1 
2 


|a|h l|a|h 


1 , T . II x-,a b' a 

= II®/ - TrfTT 2 “ll 2 ll^- TTTTT 2 “II 2 ) 


- Y + Il®ill 2 ||b- i^alh) = M 2 . 


i'.di <0 


Finally, when the original covariates z] is binary, we can judge which of the two cases in (IT^ and (fldll are 
applied by using the information available at the node j, and a slightly tighter bounds can be obtained (the 
proof of how one can make this judgment is omitted, but it can be shown in a similar manner as above). I 


B Additional Experimental Results 

In this section, we show the results on several benchmark datasets. For each dataset, (a) Computation time 
in seconds, (b) The number of traverse nodes, (c) The number of active features, (d) The number of solving 
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LASSO in IB, (e) The number of non-screened out features and total active features, (f) Computation total 
time in seconds are plotted in the following figures with S = 1.5 (left) and S = 2.0 (right). 

B.l Results on usps 
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(d) The number of solving LASSO in IB 
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(e) The number of non-screened out features and total active features 
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B.2 


Results on madelon 
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(d) The number of solving LASSO in IB 
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B.3 Results on protein 
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B.4 


Results on mnist 
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(d) The number of solving LASSO in IB 
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B.5 


Results on rcvl_binary 
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(d) The number of solving LASSO in IB 
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B.6 Results on real-sim 
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(d) The number of solving LASSO in IB 
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B.7 


Results on news20 
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(d) The number of solving LASSO in IB 
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